Efficiency and Stability of Clustering Algorithms for Linked Data

نویسندگان

  • Isabel Drost
  • Tobias Scheffer
چکیده

We are interested in finding clusters (“communities”) in networks of linked data, such as citation networks or web pages. Hierarchical clustering for networks is reviewed and an algorithmic improvement that leads to a significant performance increase is introduced. Our main focus is on the development of partitioning clustering algorithms that can deal with data represented only by link information (e.g., documents represented only by their citations) and the development of an EM algorithm for such data. A desirable property of clustering is stability; that is, small changes to the data should not lead to dramatically different clusterings. In our experiments with citation data we compare the hierarchical and partitioning clustering algorithms in terms of efficiency, stability and intra-cluster similarity. The problem of mining linked data (e.g., [4]) has become quite important as more and more information such as scientific publications or simple web pages is made available online. The most popular link mining tasks concentrate on finding communities in citation data [9] or in web pages from link topology [5]. Identification of terrorist networks [1; 8] and of fraud in telecommunication networks [2] are among the relevant applications which motivate research in this field. Clustering is an elementary data analysis step that is well examined for traditional machine learning settings and is now also being applied to linked data. When analysing linked data, it seems obvious to represent each node of this network either by its inbound, by its outbound or by both kinds of links. One can distinguish hierarchical agglomerative [6] and flat, partitioning clustering algorithms (e.g., [3]). Hierarchical clustering algorithms require a distance metric between pairs of instances to be defined, whereas k-means and EM with mixture models require the instances to be represented as a vector in feature space. Up until now, in most cases an agglomerative clustering algorithm was employed when clustering linked data. Yet this kind of algorithm is known to be very time consuming when clustering large numbers of objects and has shown to be sensitive to perturbations of the data to cluster [7]. We examine the applicability of partitioning algorithms to linked data and compare their performance in terms of efficiency, stability and intra-cluster similarity to the agglomerative algorithm. Our contribution is threefold. Firstly, we propose a caching strategy for hierarchical agglomerative clustering that improves its performance for clustering link data. Secondly, we derive an EM clustering algorithm for link data. Thirdly, we compare the clustering algorithms in terms of efficiency, stability and intra-cluster similarity for a publication database.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The ensemble clustering with maximize diversity using evolutionary optimization algorithms

Data clustering is one of the main steps in data mining, which is responsible for exploring hidden patterns in non-tagged data. Due to the complexity of the problem and the weakness of the basic clustering methods, most studies today are guided by clustering ensemble methods. Diversity in primary results is one of the most important factors that can affect the quality of the final results. Also...

متن کامل

Use of the Improved Frog-Leaping Algorithm in Data Clustering

Clustering is one of the known techniques in the field of data mining where data with similar properties is within the set of categories. K-means algorithm is one the simplest clustering algorithms which have disadvantages sensitive to initial values of the clusters and converging to the local optimum. In recent years, several algorithms are provided based on evolutionary algorithms for cluster...

متن کامل

Multi-layer Clustering Topology Design in Densely Deployed Wireless Sensor Network using Evolutionary Algorithms

Due to the resource constraint and dynamic parameters, reducing energy consumption became the most important issues of wireless sensor networks topology design. All proposed hierarchy methods cluster a WSN in different cluster layers in one step of evolutionary algorithm usage with complicated parameters which may lead to reducing efficiency and performance. In fact, in WSNs topology, increasin...

متن کامل

Improving Vehicular Ad-Hoc Network Stability Using Meta-Heuristic Algorithms

Vehicular ad-hoc network (VANET) is an important component of intelligent transportation systems, in which vehicles are equipped with on-board computing and communication devices which enable vehicle-to-vehicle communication. Consequently, with regard to larger communication due to the greater number of vehicles, stability of connectivity would be a challenging problem. Clustering technique as ...

متن کامل

ارائه یک الگوریتم خوشه بندی برای داده های دسته ای با ترکیب معیارها

Clustering is one of the main techniques in data mining. Clustering is a process that classifies data set into groups. In clustering, the data in a cluster are the closest to each other and the data in two different clusters have the most difference. Clustering algorithms are divided into two categories according to the type of data: Clustering algorithms for numerical data and clustering algor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004